The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
You need to identify the best possible model that will give the required performance
Explore and visualize the dataset.
Build a classification model to predict if the customer is going to churn or not
Optimize the model using appropriate techniques
Generate a set of insights and recommendations that will help the bank
CLIENTNUM: Client number. Unique identifier for the customer holding the account
Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
Customer_Age: Age in Years
Gender: Gender of the account holder
Dependent_count: Number of dependents
Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
Marital_Status: Marital Status of the account holder
Income_Category: Annual Income Category of the account holder
Card_Category: Type of Card
Months_on_book: Period of relationship with the bank
Total_Relationship_Count: Total no. of products held by the customer
Months_Inactive_12_mon: No. of months inactive in the last 12 months
Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
Credit_Limit: Credit Limit on the Credit Card
Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
Total_Trans_Amt: Total Transaction Amount (Last 12 months)
Total_Trans_Ct: Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
Avg_Utilization_Ratio: Represents how much of the available credit the customer spent
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
from sklearn.model_selection import GridSearchCV
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Removes the limit from the number of displayed columns and rows.
# This is so I can see the entire dataframe when I print it
pd.set_option("display.max_columns", None)
# pd.set_option('display.max_rows', None)
pd.set_option("display.max_rows", 200)
# Loading the dataset.
df = pd.read_csv("BankChurners.csv")
# Observing the first five rows of the dataset.
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
# Observing the last five rows of the dataset.
df.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.000 | 0 | 5281.000 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
# Making a copy of the data set to preserve integrity
bk = df.copy()
print(f"There are {bk.shape[0]} rows and {bk.shape[1]} columns.") # f-string
# Viewing random rows
np.random.seed(1)
bk.sample(n=10)
There are 10127 rows and 21 columns.
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6498 | 712389108 | Existing Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Blue | 36 | 6 | 3 | 2 | 2570.000 | 2107 | 463.000 | 0.651 | 4058 | 83 | 0.766 | 0.820 |
| 9013 | 718388733 | Existing Customer | 38 | F | 1 | College | NaN | Less than $40K | Blue | 32 | 2 | 3 | 3 | 2609.000 | 1259 | 1350.000 | 0.871 | 8677 | 96 | 0.627 | 0.483 |
| 2053 | 710109633 | Existing Customer | 39 | M | 2 | College | Married | $60K - $80K | Blue | 31 | 6 | 3 | 2 | 9871.000 | 1061 | 8810.000 | 0.545 | 1683 | 34 | 0.478 | 0.107 |
| 3211 | 717331758 | Existing Customer | 44 | M | 4 | Graduate | Married | $120K + | Blue | 32 | 6 | 3 | 4 | 34516.000 | 2517 | 31999.000 | 0.765 | 4228 | 83 | 0.596 | 0.073 |
| 5559 | 709460883 | Attrited Customer | 38 | F | 2 | Doctorate | Married | Less than $40K | Blue | 28 | 5 | 2 | 4 | 1614.000 | 0 | 1614.000 | 0.609 | 2437 | 46 | 0.438 | 0.000 |
| 6106 | 789105183 | Existing Customer | 54 | M | 3 | Post-Graduate | Single | $80K - $120K | Silver | 42 | 3 | 1 | 2 | 34516.000 | 2488 | 32028.000 | 0.552 | 4401 | 87 | 0.776 | 0.072 |
| 4150 | 771342183 | Attrited Customer | 53 | F | 3 | Graduate | Single | $40K - $60K | Blue | 40 | 6 | 3 | 2 | 1625.000 | 0 | 1625.000 | 0.689 | 2314 | 43 | 0.433 | 0.000 |
| 2205 | 708174708 | Existing Customer | 38 | M | 4 | Graduate | Married | $40K - $60K | Blue | 27 | 6 | 2 | 4 | 5535.000 | 1276 | 4259.000 | 0.636 | 1764 | 38 | 0.900 | 0.231 |
| 4145 | 718076733 | Existing Customer | 43 | M | 1 | Graduate | Single | $60K - $80K | Silver | 31 | 4 | 3 | 3 | 25824.000 | 1170 | 24654.000 | 0.684 | 3101 | 73 | 0.780 | 0.045 |
| 5324 | 821889858 | Attrited Customer | 50 | F | 1 | Doctorate | Single | abc | Blue | 46 | 6 | 4 | 3 | 1970.000 | 1477 | 493.000 | 0.662 | 2493 | 44 | 0.571 | 0.750 |
# Observing the range of values.
bk.describe()
| CLIENTNUM | Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 |
| mean | 739177606.334 | 46.326 | 2.346 | 35.928 | 3.813 | 2.341 | 2.455 | 8631.954 | 1162.814 | 7469.140 | 0.760 | 4404.086 | 64.859 | 0.712 | 0.275 |
| std | 36903783.450 | 8.017 | 1.299 | 7.986 | 1.554 | 1.011 | 1.106 | 9088.777 | 814.987 | 9090.685 | 0.219 | 3397.129 | 23.473 | 0.238 | 0.276 |
| min | 708082083.000 | 26.000 | 0.000 | 13.000 | 1.000 | 0.000 | 0.000 | 1438.300 | 0.000 | 3.000 | 0.000 | 510.000 | 10.000 | 0.000 | 0.000 |
| 25% | 713036770.500 | 41.000 | 1.000 | 31.000 | 3.000 | 2.000 | 2.000 | 2555.000 | 359.000 | 1324.500 | 0.631 | 2155.500 | 45.000 | 0.582 | 0.023 |
| 50% | 717926358.000 | 46.000 | 2.000 | 36.000 | 4.000 | 2.000 | 2.000 | 4549.000 | 1276.000 | 3474.000 | 0.736 | 3899.000 | 67.000 | 0.702 | 0.176 |
| 75% | 773143533.000 | 52.000 | 3.000 | 40.000 | 5.000 | 3.000 | 3.000 | 11067.500 | 1784.000 | 9859.000 | 0.859 | 4741.000 | 81.000 | 0.818 | 0.503 |
| max | 828343083.000 | 73.000 | 5.000 | 56.000 | 6.000 | 6.000 | 6.000 | 34516.000 | 2517.000 | 34516.000 | 3.397 | 18484.000 | 139.000 | 3.714 | 0.999 |
bk.describe()
| CLIENTNUM | Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 |
| mean | 739177606.334 | 46.326 | 2.346 | 35.928 | 3.813 | 2.341 | 2.455 | 8631.954 | 1162.814 | 7469.140 | 0.760 | 4404.086 | 64.859 | 0.712 | 0.275 |
| std | 36903783.450 | 8.017 | 1.299 | 7.986 | 1.554 | 1.011 | 1.106 | 9088.777 | 814.987 | 9090.685 | 0.219 | 3397.129 | 23.473 | 0.238 | 0.276 |
| min | 708082083.000 | 26.000 | 0.000 | 13.000 | 1.000 | 0.000 | 0.000 | 1438.300 | 0.000 | 3.000 | 0.000 | 510.000 | 10.000 | 0.000 | 0.000 |
| 25% | 713036770.500 | 41.000 | 1.000 | 31.000 | 3.000 | 2.000 | 2.000 | 2555.000 | 359.000 | 1324.500 | 0.631 | 2155.500 | 45.000 | 0.582 | 0.023 |
| 50% | 717926358.000 | 46.000 | 2.000 | 36.000 | 4.000 | 2.000 | 2.000 | 4549.000 | 1276.000 | 3474.000 | 0.736 | 3899.000 | 67.000 | 0.702 | 0.176 |
| 75% | 773143533.000 | 52.000 | 3.000 | 40.000 | 5.000 | 3.000 | 3.000 | 11067.500 | 1784.000 | 9859.000 | 0.859 | 4741.000 | 81.000 | 0.818 | 0.503 |
| max | 828343083.000 | 73.000 | 5.000 | 56.000 | 6.000 | 6.000 | 6.000 | 34516.000 | 2517.000 | 34516.000 | 3.397 | 18484.000 | 139.000 | 3.714 | 0.999 |
# Looking at the number of open and closed accounts.
bk["Attrition_Flag"].value_counts()
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64
# Checking for missing values and variable types.
bk.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
# Missing values by column.
bk.isnull().sum().sort_values(ascending=False)
Education_Level 1519 Marital_Status 749 CLIENTNUM 0 Contacts_Count_12_mon 0 Total_Ct_Chng_Q4_Q1 0 Total_Trans_Ct 0 Total_Trans_Amt 0 Total_Amt_Chng_Q4_Q1 0 Avg_Open_To_Buy 0 Total_Revolving_Bal 0 Credit_Limit 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Attrition_Flag 0 Months_on_book 0 Card_Category 0 Income_Category 0 Dependent_count 0 Gender 0 Customer_Age 0 Avg_Utilization_Ratio 0 dtype: int64
# Dropping CLIENTNUM as it is redundant
bk.drop(["CLIENTNUM"], axis=1, inplace=True)
# Printing number of count of each unique value in each column
cat_cols = [
"Attrition_Flag",
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
for column in cat_cols:
print(bk[column].value_counts())
print("-" * 40)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ---------------------------------------- F 5358 M 4769 Name: Gender, dtype: int64 ---------------------------------------- Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ---------------------------------------- Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 ---------------------------------------- Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 ---------------------------------------- Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ----------------------------------------
# Fixing categorical varibles
cat_cols1 = [
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
bk[cat_cols1] = bk[cat_cols1].astype("category")
# Verifying category changes were made.
bk.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null object 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(5), float64(5), int64(9), object(1) memory usage: 1.2+ MB
# Observing the range of values after dropping CLIENTNUM.
bk.describe()
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 |
| mean | 46.326 | 2.346 | 35.928 | 3.813 | 2.341 | 2.455 | 8631.954 | 1162.814 | 7469.140 | 0.760 | 4404.086 | 64.859 | 0.712 | 0.275 |
| std | 8.017 | 1.299 | 7.986 | 1.554 | 1.011 | 1.106 | 9088.777 | 814.987 | 9090.685 | 0.219 | 3397.129 | 23.473 | 0.238 | 0.276 |
| min | 26.000 | 0.000 | 13.000 | 1.000 | 0.000 | 0.000 | 1438.300 | 0.000 | 3.000 | 0.000 | 510.000 | 10.000 | 0.000 | 0.000 |
| 25% | 41.000 | 1.000 | 31.000 | 3.000 | 2.000 | 2.000 | 2555.000 | 359.000 | 1324.500 | 0.631 | 2155.500 | 45.000 | 0.582 | 0.023 |
| 50% | 46.000 | 2.000 | 36.000 | 4.000 | 2.000 | 2.000 | 4549.000 | 1276.000 | 3474.000 | 0.736 | 3899.000 | 67.000 | 0.702 | 0.176 |
| 75% | 52.000 | 3.000 | 40.000 | 5.000 | 3.000 | 3.000 | 11067.500 | 1784.000 | 9859.000 | 0.859 | 4741.000 | 81.000 | 0.818 | 0.503 |
| max | 73.000 | 5.000 | 56.000 | 6.000 | 6.000 | 6.000 | 34516.000 | 2517.000 | 34516.000 | 3.397 | 18484.000 | 139.000 | 3.714 | 0.999 |
# let's check for duplicate values in the data
bk.duplicated().sum()
0
# Looking at unique values in the dataset.
bk.nunique()
Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 6 Marital_Status 3 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.50, 1.00)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Observations on Customer_age
histogram_boxplot(bk, "Customer_Age")
# Looking deeper at outliers
max_threshold_age = bk["Customer_Age"].quantile(0.9998)
max_threshold_age
68.0
# Looking at two highest values for customer age
bk[bk["Customer_Age"] > max_threshold_age]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 251 | Existing Customer | 73 | M | 0 | High School | Married | $40K - $60K | Blue | 36 | 5 | 3 | 2 | 4469.000 | 1125 | 3344.000 | 1.363 | 1765 | 34 | 1.615 | 0.252 |
| 254 | Existing Customer | 70 | M | 0 | High School | Married | Less than $40K | Blue | 56 | 3 | 2 | 3 | 3252.000 | 1495 | 1757.000 | 0.581 | 1227 | 15 | 0.875 | 0.460 |
# Dropping two highest customer ages.
bk.drop(index=bk[bk.Customer_Age > 69].index, inplace=True)
# Observations on Dependent_count
histogram_boxplot(bk, "Dependent_count")
# Observations on Total_Relationship_Count
histogram_boxplot(bk, "Total_Relationship_Count")
# Observations on Total_Revolving_Bal
histogram_boxplot(bk, "Total_Revolving_Bal")
# Observations on Credit_Limit
histogram_boxplot(bk, "Credit_Limit")
# Looking deeper at outliers
max_threshold_Limit = bk["Credit_Limit"].quantile(0.90)
max_threshold_Limit
23400.800000000003
# Looking at two highest values for customer age
credit_limit_cap = bk[df["Credit_Limit"] > max_threshold_Limit]
credit_limit_cap
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | Existing Customer | 51 | M | 4 | NaN | Married | $120K + | Gold | 46 | 6 | 1 | 3 | 34516.000 | 2264 | 32252.000 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 7 | Existing Customer | 32 | M | 0 | High School | NaN | $60K - $80K | Silver | 27 | 2 | 2 | 2 | 29081.000 | 1396 | 27685.000 | 2.204 | 1538 | 36 | 0.714 | 0.048 |
| 16 | Existing Customer | 48 | M | 4 | Post-Graduate | Single | $80K - $120K | Blue | 36 | 6 | 2 | 3 | 30367.000 | 2362 | 28005.000 | 1.708 | 1671 | 27 | 0.929 | 0.078 |
| 40 | Existing Customer | 45 | M | 3 | Graduate | Single | $80K - $120K | Blue | 41 | 2 | 2 | 2 | 32426.000 | 578 | 31848.000 | 1.042 | 1109 | 28 | 0.474 | 0.018 |
| 45 | Existing Customer | 49 | M | 4 | Uneducated | Single | $80K - $120K | Blue | 30 | 3 | 2 | 3 | 34516.000 | 0 | 34516.000 | 1.621 | 1444 | 28 | 1.333 | 0.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10098 | Attrited Customer | 55 | M | 3 | Graduate | Single | $120K + | Silver | 36 | 4 | 3 | 4 | 34516.000 | 0 | 34516.000 | 1.007 | 9931 | 70 | 0.750 | 0.000 |
| 10100 | Existing Customer | 39 | M | 2 | Graduate | NaN | $60K - $80K | Silver | 36 | 4 | 2 | 2 | 29808.000 | 0 | 29808.000 | 0.669 | 16098 | 128 | 0.684 | 0.000 |
| 10104 | Existing Customer | 51 | M | 3 | Graduate | Single | $60K - $80K | Silver | 36 | 3 | 2 | 2 | 29663.000 | 1743 | 27920.000 | 0.667 | 14638 | 93 | 0.722 | 0.059 |
| 10110 | Existing Customer | 56 | M | 1 | Graduate | Single | $80K - $120K | Silver | 49 | 5 | 2 | 2 | 34516.000 | 1091 | 33425.000 | 0.640 | 15274 | 108 | 0.714 | 0.032 |
| 10112 | Attrited Customer | 33 | M | 2 | College | Married | $120K + | Gold | 20 | 2 | 1 | 4 | 34516.000 | 0 | 34516.000 | 1.004 | 9338 | 73 | 0.622 | 0.000 |
1013 rows × 20 columns
# Observations on Avg_Open_To_Buy
histogram_boxplot(bk, "Avg_Open_To_Buy")
# Observations on Total_Trans_Amt
histogram_boxplot(bk, "Total_Trans_Amt")
# Checking 10 largest values of amount spend on meat products
bk.Total_Trans_Amt.nlargest(10)
9964 18484 10073 17995 10097 17744 9601 17634 9341 17628 10117 17498 10028 17437 9643 17390 9712 17350 9645 17258 Name: Total_Trans_Amt, dtype: int64
# Looking at rows with outliers
bk[bk["Total_Trans_Amt"] > 8750]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8650 | Existing Customer | 43 | F | 2 | College | Married | Less than $40K | Blue | 37 | 3 | 3 | 1 | 5380.000 | 1337 | 4043.000 | 0.845 | 8947 | 93 | 0.576 | 0.249 |
| 8670 | Existing Customer | 39 | F | 4 | Doctorate | Married | Less than $40K | Blue | 27 | 2 | 1 | 1 | 8587.000 | 1324 | 7263.000 | 0.897 | 8854 | 92 | 0.643 | 0.154 |
| 8708 | Existing Customer | 51 | F | 2 | NaN | Married | $40K - $60K | Blue | 36 | 2 | 2 | 1 | 14800.000 | 1020 | 13780.000 | 0.896 | 8796 | 94 | 0.593 | 0.069 |
| 8734 | Existing Customer | 45 | M | 5 | NaN | Married | $60K - $80K | Blue | 38 | 1 | 2 | 3 | 6895.000 | 2314 | 4581.000 | 0.840 | 8778 | 85 | 0.604 | 0.336 |
| 8847 | Existing Customer | 47 | M | 3 | Graduate | NaN | Less than $40K | Blue | 43 | 1 | 3 | 3 | 2781.000 | 1950 | 831.000 | 0.980 | 9331 | 93 | 0.500 | 0.701 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10121 | Existing Customer | 56 | F | 1 | Graduate | Single | Less than $40K | Blue | 50 | 4 | 1 | 4 | 3688.000 | 606 | 3082.000 | 0.570 | 14596 | 120 | 0.791 | 0.164 |
| 10122 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10126 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
869 rows × 20 columns
# Observations on Total_Trans_Ct
histogram_boxplot(bk, "Total_Trans_Ct")
# Looking at rows with outliers
bk[bk["Total_Trans_Ct"] > 135]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9324 | Existing Customer | 41 | M | 3 | NaN | Married | $120K + | Blue | 33 | 2 | 4 | 3 | 34516.000 | 638 | 33878.000 | 0.724 | 13085 | 139 | 0.675 | 0.018 |
| 9586 | Existing Customer | 56 | F | 1 | High School | Married | abc | Blue | 49 | 1 | 2 | 1 | 17542.000 | 2517 | 15025.000 | 0.800 | 13939 | 138 | 0.792 | 0.143 |
# Dropping two highest transaction counts
bk.drop(index=bk[bk.Total_Trans_Ct > 135].index, inplace=True)
# Observations on Avg Utilization Ratio
histogram_boxplot(bk, "Avg_Utilization_Ratio")
# Observation on Months_on_book
histogram_boxplot(bk, "Months_on_book")
# Observation on Total Transaction Amount Change Q4 Q1
histogram_boxplot(bk, "Total_Amt_Chng_Q4_Q1")
# Looking deeper at outliers
max_threshold_Amt_Chng = bk["Total_Amt_Chng_Q4_Q1"].quantile(0.99)
max_threshold_Amt_Chng
1.5147800000000007
# Looking at two highest values
Total_Amt_Chng_Cap = bk[df["Total_Amt_Chng_Q4_Q1"] > max_threshold_Amt_Chng]
Total_Amt_Chng_Cap
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| 6 | Existing Customer | 51 | M | 4 | NaN | Married | $120K + | Gold | 46 | 6 | 1 | 3 | 34516.000 | 2264 | 32252.000 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 7 | Existing Customer | 32 | M | 0 | High School | NaN | $60K - $80K | Silver | 27 | 2 | 2 | 2 | 29081.000 | 1396 | 27685.000 | 2.204 | 1538 | 36 | 0.714 | 0.048 |
| 8 | Existing Customer | 37 | M | 3 | Uneducated | Single | $60K - $80K | Blue | 36 | 5 | 2 | 0 | 22352.000 | 2517 | 19835.000 | 3.355 | 1350 | 24 | 1.182 | 0.113 |
| 9 | Existing Customer | 48 | M | 2 | Graduate | Single | $80K - $120K | Blue | 36 | 6 | 3 | 3 | 11656.000 | 1677 | 9979.000 | 1.524 | 1441 | 32 | 0.882 | 0.144 |
| 12 | Existing Customer | 56 | M | 1 | College | Single | $80K - $120K | Blue | 36 | 3 | 6 | 0 | 11751.000 | 0 | 11751.000 | 3.397 | 1539 | 17 | 3.250 | 0.000 |
| 15 | Existing Customer | 44 | M | 4 | NaN | NaN | $80K - $120K | Blue | 37 | 5 | 1 | 2 | 4234.000 | 972 | 3262.000 | 1.707 | 1348 | 27 | 1.700 | 0.230 |
| 16 | Existing Customer | 48 | M | 4 | Post-Graduate | Single | $80K - $120K | Blue | 36 | 6 | 2 | 3 | 30367.000 | 2362 | 28005.000 | 1.708 | 1671 | 27 | 0.929 | 0.078 |
| 18 | Existing Customer | 61 | M | 1 | High School | Married | $40K - $60K | Blue | 56 | 2 | 2 | 3 | 3193.000 | 2517 | 676.000 | 1.831 | 1336 | 30 | 1.143 | 0.788 |
| 22 | Existing Customer | 41 | M | 3 | High School | Married | $40K - $60K | Blue | 33 | 4 | 2 | 1 | 4470.000 | 680 | 3790.000 | 1.608 | 931 | 18 | 1.571 | 0.152 |
| 32 | Existing Customer | 41 | M | 4 | Graduate | Married | $60K - $80K | Blue | 36 | 4 | 1 | 2 | 8923.000 | 2517 | 6406.000 | 1.726 | 1589 | 24 | 1.667 | 0.282 |
| 33 | Existing Customer | 53 | F | 2 | College | Married | Less than $40K | Blue | 38 | 5 | 2 | 3 | 2650.000 | 1490 | 1160.000 | 1.750 | 1411 | 28 | 1.000 | 0.562 |
| 36 | Existing Customer | 55 | F | 3 | Graduate | Married | Less than $40K | Blue | 36 | 6 | 2 | 3 | 3035.000 | 2298 | 737.000 | 1.724 | 1877 | 37 | 1.176 | 0.757 |
| 45 | Existing Customer | 49 | M | 4 | Uneducated | Single | $80K - $120K | Blue | 30 | 3 | 2 | 3 | 34516.000 | 0 | 34516.000 | 1.621 | 1444 | 28 | 1.333 | 0.000 |
| 46 | Existing Customer | 56 | M | 2 | Doctorate | Married | $60K - $80K | Blue | 45 | 6 | 2 | 0 | 2283.000 | 1430 | 853.000 | 2.316 | 1741 | 27 | 0.588 | 0.626 |
| 47 | Existing Customer | 59 | M | 1 | Doctorate | Married | $40K - $60K | Blue | 52 | 3 | 2 | 2 | 2548.000 | 2020 | 528.000 | 2.357 | 1719 | 27 | 1.700 | 0.793 |
| 58 | Existing Customer | 44 | F | 5 | Graduate | Married | abc | Blue | 35 | 4 | 1 | 2 | 6273.000 | 978 | 5295.000 | 2.275 | 1359 | 25 | 1.083 | 0.156 |
| 68 | Existing Customer | 49 | M | 2 | Graduate | Married | $60K - $80K | Blue | 32 | 2 | 2 | 2 | 1687.000 | 1107 | 580.000 | 1.715 | 1670 | 17 | 2.400 | 0.656 |
| 88 | Existing Customer | 44 | M | 3 | High School | Single | $60K - $80K | Blue | 31 | 4 | 3 | 1 | 12756.000 | 837 | 11919.000 | 1.932 | 1413 | 14 | 1.800 | 0.066 |
| 89 | Existing Customer | 57 | M | 2 | NaN | Married | $120K + | Blue | 45 | 5 | 3 | 3 | 5266.000 | 0 | 5266.000 | 1.702 | 1516 | 29 | 1.636 | 0.000 |
| 94 | Existing Customer | 45 | F | 3 | NaN | Married | abc | Blue | 28 | 5 | 1 | 2 | 2535.000 | 2440 | 95.000 | 1.705 | 1312 | 20 | 1.222 | 0.963 |
| 95 | Existing Customer | 64 | M | 1 | Graduate | Married | Less than $40K | Blue | 52 | 6 | 4 | 3 | 1709.000 | 895 | 814.000 | 1.656 | 1673 | 32 | 0.882 | 0.524 |
| 113 | Existing Customer | 54 | F | 0 | Uneducated | Married | Less than $40K | Blue | 36 | 2 | 2 | 2 | 1494.000 | 706 | 788.000 | 1.674 | 1305 | 24 | 3.000 | 0.473 |
| 115 | Existing Customer | 49 | M | 1 | Graduate | Single | $80K - $120K | Blue | 36 | 6 | 2 | 2 | 18886.000 | 895 | 17991.000 | 1.826 | 1235 | 18 | 1.571 | 0.047 |
| 117 | Existing Customer | 50 | M | 3 | High School | Single | $80K - $120K | Blue | 39 | 4 | 1 | 4 | 9964.000 | 1559 | 8405.000 | 1.873 | 1626 | 25 | 0.786 | 0.156 |
| 122 | Existing Customer | 53 | M | 1 | High School | Married | $80K - $120K | Blue | 47 | 3 | 2 | 2 | 12262.000 | 1836 | 10426.000 | 1.584 | 1659 | 38 | 1.533 | 0.150 |
| 133 | Existing Customer | 52 | M | 1 | Graduate | Married | $80K - $120K | Blue | 43 | 3 | 1 | 4 | 3710.000 | 2517 | 1193.000 | 1.541 | 1578 | 34 | 0.700 | 0.678 |
| 137 | Existing Customer | 45 | M | 4 | College | Divorced | $60K - $80K | Blue | 40 | 5 | 1 | 0 | 10408.000 | 1186 | 9222.000 | 1.689 | 2560 | 42 | 1.211 | 0.114 |
| 142 | Existing Customer | 54 | M | 4 | Graduate | Married | $80K - $120K | Blue | 34 | 2 | 3 | 2 | 14926.000 | 2517 | 12409.000 | 1.996 | 1576 | 25 | 1.500 | 0.169 |
| 154 | Existing Customer | 53 | F | 1 | College | Married | Less than $40K | Blue | 47 | 4 | 2 | 3 | 2154.000 | 930 | 1224.000 | 2.121 | 1439 | 26 | 1.364 | 0.432 |
| 177 | Existing Customer | 67 | F | 1 | Graduate | Married | Less than $40K | Blue | 56 | 4 | 3 | 2 | 3006.000 | 2517 | 489.000 | 2.053 | 1661 | 32 | 1.000 | 0.837 |
| 180 | Existing Customer | 45 | M | 2 | Uneducated | Married | $40K - $60K | Blue | 34 | 3 | 2 | 1 | 5771.000 | 2248 | 3523.000 | 1.791 | 1387 | 18 | 0.800 | 0.390 |
| 190 | Existing Customer | 57 | M | 1 | Graduate | Married | $80K - $120K | Blue | 47 | 5 | 3 | 1 | 14612.000 | 1976 | 12636.000 | 1.768 | 1827 | 24 | 3.000 | 0.135 |
| 197 | Existing Customer | 53 | M | 3 | Uneducated | Married | $80K - $120K | Blue | 43 | 3 | 3 | 1 | 16393.000 | 1423 | 14970.000 | 1.642 | 1461 | 42 | 0.750 | 0.087 |
| 219 | Existing Customer | 44 | F | 3 | Uneducated | Divorced | Less than $40K | Silver | 38 | 4 | 1 | 3 | 11127.000 | 1835 | 9292.000 | 2.368 | 1546 | 25 | 1.273 | 0.165 |
| 231 | Existing Customer | 57 | M | 2 | NaN | Married | $80K - $120K | Blue | 46 | 2 | 3 | 0 | 18871.000 | 1740 | 17131.000 | 1.727 | 1516 | 21 | 2.000 | 0.092 |
| 263 | Existing Customer | 46 | F | 1 | NaN | Married | Less than $40K | Blue | 36 | 2 | 3 | 2 | 2474.000 | 2143 | 331.000 | 1.530 | 1541 | 40 | 1.000 | 0.866 |
| 284 | Existing Customer | 61 | M | 0 | Graduate | Married | $40K - $60K | Blue | 52 | 3 | 1 | 2 | 2939.000 | 1999 | 940.000 | 2.145 | 2434 | 33 | 1.538 | 0.680 |
| 295 | Existing Customer | 60 | M | 0 | High School | Married | $40K - $60K | Blue | 36 | 5 | 1 | 3 | 3281.000 | 837 | 2444.000 | 1.859 | 1424 | 29 | 1.417 | 0.255 |
| 336 | Existing Customer | 56 | F | 1 | Graduate | Married | Less than $40K | Blue | 38 | 4 | 3 | 3 | 2578.000 | 2462 | 116.000 | 1.707 | 1378 | 29 | 0.812 | 0.955 |
| 392 | Existing Customer | 56 | M | 3 | High School | Married | $60K - $80K | Blue | 36 | 5 | 1 | 3 | 9226.000 | 1233 | 7993.000 | 1.521 | 1626 | 29 | 1.636 | 0.134 |
| 422 | Existing Customer | 63 | M | 0 | High School | Married | $60K - $80K | Blue | 36 | 6 | 2 | 3 | 2567.000 | 0 | 2567.000 | 1.636 | 2399 | 43 | 0.955 | 0.000 |
| 431 | Existing Customer | 47 | F | 4 | NaN | Divorced | $40K - $60K | Blue | 34 | 6 | 1 | 2 | 3502.000 | 1851 | 1651.000 | 2.023 | 1814 | 31 | 0.722 | 0.529 |
| 466 | Existing Customer | 63 | M | 2 | Graduate | Married | $60K - $80K | Blue | 49 | 5 | 2 | 3 | 14035.000 | 2061 | 11974.000 | 2.271 | 1606 | 30 | 1.500 | 0.147 |
| 495 | Existing Customer | 52 | M | 4 | High School | Married | $80K - $120K | Blue | 42 | 6 | 1 | 0 | 5738.000 | 1112 | 4626.000 | 1.519 | 1922 | 48 | 0.655 | 0.194 |
| 501 | Existing Customer | 55 | M | 1 | High School | Married | $80K - $120K | Blue | 45 | 3 | 3 | 3 | 16794.000 | 809 | 15985.000 | 1.566 | 1917 | 33 | 0.941 | 0.048 |
| 656 | Existing Customer | 55 | F | 4 | Graduate | Married | $40K - $60K | Blue | 41 | 4 | 3 | 3 | 3156.000 | 2517 | 639.000 | 1.516 | 2091 | 48 | 0.778 | 0.798 |
| 658 | Existing Customer | 46 | M | 4 | Graduate | Married | $60K - $80K | Blue | 35 | 5 | 1 | 2 | 1535.000 | 700 | 835.000 | 2.282 | 1848 | 25 | 1.083 | 0.456 |
| 714 | Existing Customer | 38 | M | 1 | High School | Married | $80K - $120K | Blue | 23 | 4 | 3 | 2 | 22149.000 | 987 | 21162.000 | 1.633 | 2409 | 36 | 0.800 | 0.045 |
| 761 | Existing Customer | 58 | M | 0 | Uneducated | Married | $80K - $120K | Blue | 38 | 6 | 3 | 3 | 3785.000 | 1935 | 1850.000 | 1.558 | 1824 | 49 | 1.450 | 0.511 |
| 773 | Existing Customer | 61 | M | 0 | Post-Graduate | Married | abc | Blue | 53 | 6 | 2 | 3 | 14434.000 | 1927 | 12507.000 | 2.675 | 1731 | 32 | 3.571 | 0.134 |
| 841 | Existing Customer | 37 | F | 3 | NaN | Married | Less than $40K | Blue | 25 | 6 | 2 | 1 | 1438.300 | 674 | 764.300 | 2.180 | 1717 | 31 | 0.722 | 0.469 |
| 855 | Existing Customer | 39 | F | 2 | Graduate | Married | abc | Blue | 31 | 4 | 2 | 3 | 1438.300 | 997 | 441.300 | 1.867 | 2583 | 47 | 0.958 | 0.693 |
| 869 | Existing Customer | 39 | M | 2 | College | Married | $60K - $80K | Blue | 35 | 4 | 3 | 2 | 7410.000 | 2517 | 4893.000 | 1.924 | 2398 | 37 | 1.176 | 0.340 |
| 870 | Existing Customer | 53 | M | 3 | Graduate | Married | $120K + | Blue | 44 | 3 | 2 | 3 | 12163.000 | 2082 | 10081.000 | 1.554 | 2649 | 56 | 0.697 | 0.171 |
| 906 | Existing Customer | 42 | M | 3 | High School | Single | $40K - $60K | Blue | 35 | 4 | 3 | 3 | 3266.000 | 1900 | 1366.000 | 1.568 | 2183 | 56 | 0.750 | 0.582 |
| 914 | Existing Customer | 55 | M | 2 | College | Married | $80K - $120K | Blue | 45 | 5 | 1 | 2 | 2929.000 | 2431 | 498.000 | 1.632 | 1903 | 39 | 0.625 | 0.830 |
| 919 | Existing Customer | 33 | M | 2 | Post-Graduate | Married | $80K - $120K | Blue | 22 | 3 | 3 | 3 | 1438.300 | 681 | 757.300 | 1.542 | 2730 | 48 | 0.778 | 0.473 |
| 975 | Existing Customer | 59 | M | 1 | Post-Graduate | Married | $40K - $60K | Blue | 36 | 4 | 2 | 2 | 2483.000 | 1372 | 1111.000 | 1.642 | 1704 | 35 | 0.458 | 0.553 |
| 1008 | Existing Customer | 39 | F | 1 | Graduate | Married | abc | Blue | 30 | 3 | 2 | 3 | 2917.000 | 1357 | 1560.000 | 1.593 | 2422 | 46 | 0.438 | 0.465 |
| 1011 | Existing Customer | 31 | M | 2 | Uneducated | Married | Less than $40K | Blue | 18 | 3 | 2 | 4 | 1438.300 | 0 | 1438.300 | 1.561 | 2789 | 50 | 1.083 | 0.000 |
| 1051 | Existing Customer | 54 | F | 1 | Uneducated | Married | Less than $40K | Blue | 40 | 5 | 2 | 4 | 2178.000 | 1144 | 1034.000 | 1.585 | 1724 | 47 | 1.043 | 0.525 |
| 1060 | Existing Customer | 37 | F | 2 | Graduate | Married | Less than $40K | Blue | 30 | 5 | 2 | 2 | 2077.000 | 1392 | 685.000 | 1.553 | 1629 | 36 | 1.250 | 0.670 |
| 1085 | Existing Customer | 45 | F | 3 | Graduate | Single | abc | Blue | 36 | 3 | 3 | 4 | 11189.000 | 2517 | 8672.000 | 2.041 | 2959 | 58 | 1.231 | 0.225 |
| 1118 | Existing Customer | 29 | M | 0 | High School | Married | $40K - $60K | Blue | 13 | 4 | 1 | 4 | 1458.000 | 671 | 787.000 | 1.583 | 2449 | 56 | 0.931 | 0.460 |
| 1166 | Existing Customer | 38 | F | 2 | Graduate | Married | Less than $40K | Blue | 36 | 6 | 3 | 4 | 2717.000 | 1591 | 1126.000 | 1.596 | 2064 | 33 | 1.062 | 0.586 |
| 1176 | Existing Customer | 34 | M | 2 | College | Married | $80K - $120K | Blue | 22 | 4 | 2 | 4 | 1631.000 | 0 | 1631.000 | 1.893 | 2962 | 57 | 1.111 | 0.000 |
| 1195 | Existing Customer | 36 | M | 2 | Graduate | Married | $80K - $120K | Blue | 28 | 5 | 3 | 3 | 5634.000 | 1996 | 3638.000 | 1.589 | 2366 | 49 | 0.690 | 0.354 |
| 1219 | Existing Customer | 38 | F | 4 | Graduate | Married | abc | Blue | 28 | 4 | 1 | 2 | 6861.000 | 1598 | 5263.000 | 2.103 | 2228 | 39 | 0.950 | 0.233 |
| 1232 | Existing Customer | 58 | M | 1 | Uneducated | Married | $80K - $120K | Blue | 50 | 5 | 3 | 2 | 6978.000 | 1148 | 5830.000 | 1.645 | 1362 | 26 | 1.000 | 0.165 |
| 1369 | Existing Customer | 36 | F | 2 | Uneducated | Married | Less than $40K | Blue | 36 | 4 | 2 | 2 | 4066.000 | 1639 | 2427.000 | 1.749 | 3040 | 56 | 0.931 | 0.403 |
| 1407 | Existing Customer | 33 | F | 3 | Graduate | Married | Less than $40K | Blue | 25 | 3 | 2 | 2 | 2307.000 | 1385 | 922.000 | 1.515 | 2440 | 48 | 0.846 | 0.600 |
| 1417 | Existing Customer | 39 | F | 1 | Graduate | Married | $40K - $60K | Blue | 27 | 3 | 1 | 2 | 2700.000 | 2220 | 480.000 | 1.595 | 2901 | 59 | 1.360 | 0.822 |
| 1455 | Existing Customer | 39 | F | 2 | Doctorate | Married | abc | Blue | 36 | 5 | 2 | 4 | 8058.000 | 791 | 7267.000 | 1.787 | 2742 | 42 | 2.000 | 0.098 |
| 1473 | Existing Customer | 35 | M | 2 | Graduate | Married | $60K - $80K | Blue | 36 | 3 | 3 | 2 | 1438.300 | 0 | 1438.300 | 1.623 | 2492 | 55 | 0.897 | 0.000 |
| 1486 | Existing Customer | 39 | M | 2 | Graduate | Married | $40K - $60K | Blue | 31 | 5 | 3 | 2 | 8687.000 | 1146 | 7541.000 | 1.800 | 2279 | 33 | 1.357 | 0.132 |
| 1491 | Existing Customer | 36 | F | 3 | High School | Married | Less than $40K | Blue | 27 | 4 | 2 | 4 | 3444.000 | 2441 | 1003.000 | 1.563 | 2668 | 51 | 0.545 | 0.709 |
| 1521 | Existing Customer | 36 | M | 2 | Doctorate | Married | $60K - $80K | Blue | 30 | 3 | 3 | 4 | 4934.000 | 1504 | 3430.000 | 1.520 | 2747 | 42 | 0.448 | 0.305 |
| 1570 | Existing Customer | 49 | M | 2 | NaN | Single | $60K - $80K | Blue | 38 | 4 | 1 | 2 | 2461.000 | 1586 | 875.000 | 1.676 | 1729 | 35 | 0.750 | 0.644 |
| 1599 | Existing Customer | 35 | F | 1 | High School | Married | abc | Blue | 17 | 5 | 3 | 3 | 4588.000 | 2517 | 2071.000 | 1.525 | 2442 | 62 | 0.632 | 0.549 |
| 1627 | Existing Customer | 38 | F | 2 | Graduate | Married | Less than $40K | Blue | 33 | 5 | 2 | 3 | 2228.000 | 1717 | 511.000 | 1.625 | 2334 | 61 | 0.848 | 0.771 |
| 1647 | Existing Customer | 37 | M | 3 | NaN | Married | $60K - $80K | Blue | 24 | 3 | 1 | 2 | 11894.000 | 816 | 11078.000 | 1.561 | 2438 | 45 | 0.607 | 0.069 |
| 1689 | Existing Customer | 34 | M | 0 | Graduate | Married | $60K - $80K | Blue | 26 | 4 | 3 | 3 | 5175.000 | 977 | 4198.000 | 1.705 | 2405 | 49 | 0.885 | 0.189 |
| 1718 | Existing Customer | 42 | F | 4 | Post-Graduate | Single | Less than $40K | Blue | 36 | 6 | 2 | 3 | 1438.300 | 674 | 764.300 | 1.769 | 2451 | 55 | 1.292 | 0.469 |
| 1873 | Existing Customer | 38 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 36 | 5 | 2 | 3 | 3421.000 | 2308 | 1113.000 | 2.037 | 2269 | 39 | 1.053 | 0.675 |
| 1879 | Existing Customer | 35 | F | 1 | Graduate | Married | $40K - $60K | Blue | 29 | 4 | 2 | 2 | 2523.000 | 1877 | 646.000 | 1.631 | 2005 | 50 | 1.174 | 0.744 |
| 1882 | Existing Customer | 36 | M | 3 | College | Married | $120K + | Blue | 26 | 3 | 2 | 3 | 17116.000 | 939 | 16177.000 | 1.575 | 2547 | 45 | 0.731 | 0.055 |
| 1883 | Existing Customer | 37 | M | 2 | College | Married | $80K - $120K | Blue | 17 | 5 | 3 | 2 | 4631.000 | 1991 | 2640.000 | 1.669 | 2864 | 37 | 0.947 | 0.430 |
| 1934 | Existing Customer | 36 | F | 2 | NaN | Married | Less than $40K | Blue | 27 | 3 | 3 | 4 | 3049.000 | 2198 | 851.000 | 1.546 | 2617 | 48 | 1.000 | 0.721 |
| 1993 | Existing Customer | 31 | M | 2 | College | Married | Less than $40K | Blue | 21 | 6 | 2 | 4 | 2718.000 | 1920 | 798.000 | 1.604 | 2570 | 49 | 1.042 | 0.706 |
| 2263 | Existing Customer | 45 | F | 4 | Uneducated | Married | $40K - $60K | Blue | 36 | 3 | 1 | 3 | 3111.000 | 2290 | 821.000 | 1.612 | 1998 | 41 | 0.464 | 0.736 |
| 2321 | Existing Customer | 36 | M | 2 | Uneducated | Married | $60K - $80K | Blue | 21 | 4 | 1 | 5 | 12763.000 | 2517 | 10246.000 | 1.631 | 2599 | 46 | 0.917 | 0.197 |
| 2324 | Existing Customer | 37 | F | 4 | Graduate | Married | Less than $40K | Blue | 28 | 4 | 3 | 1 | 1825.000 | 1389 | 436.000 | 1.595 | 2878 | 54 | 0.636 | 0.761 |
| 2337 | Existing Customer | 50 | F | 2 | Graduate | Divorced | $40K - $60K | Blue | 40 | 6 | 2 | 5 | 8307.000 | 2517 | 5790.000 | 1.743 | 2293 | 36 | 0.800 | 0.303 |
| 2530 | Existing Customer | 35 | M | 3 | Graduate | Married | $120K + | Blue | 28 | 4 | 3 | 2 | 14869.000 | 0 | 14869.000 | 1.624 | 2753 | 48 | 1.087 | 0.000 |
| 2565 | Existing Customer | 39 | M | 3 | Graduate | Married | $120K + | Blue | 36 | 3 | 3 | 2 | 32964.000 | 2231 | 30733.000 | 1.731 | 3094 | 45 | 1.647 | 0.068 |
| 2752 | Existing Customer | 37 | M | 3 | Graduate | Married | $60K - $80K | Blue | 31 | 3 | 1 | 3 | 4940.000 | 1199 | 3741.000 | 1.570 | 3246 | 57 | 0.900 | 0.243 |
| 2761 | Existing Customer | 39 | M | 4 | High School | Married | $80K - $120K | Blue | 25 | 6 | 2 | 3 | 8208.000 | 0 | 8208.000 | 1.568 | 3135 | 39 | 1.167 | 0.000 |
| 3061 | Existing Customer | 44 | M | 3 | NaN | Married | $80K - $120K | Blue | 38 | 3 | 1 | 4 | 4198.000 | 1782 | 2416.000 | 1.526 | 2051 | 36 | 1.000 | 0.424 |
| 3270 | Existing Customer | 49 | M | 3 | High School | NaN | $60K - $80K | Blue | 36 | 3 | 2 | 2 | 9551.000 | 1833 | 7718.000 | 1.675 | 3213 | 52 | 1.476 | 0.192 |
# Observation on Total Transaction Count Change Q4 Q1
histogram_boxplot(bk, "Total_Ct_Chng_Q4_Q1")
# Looking deeper at outliers
max_threshold_Ct_Chng = bk["Total_Ct_Chng_Q4_Q1"].quantile(0.992)
max_threshold_Ct_Chng
1.5
# Looking at two highest values
Total_Ct_Chng_Cap = bk[df["Total_Ct_Chng_Q4_Q1"] > max_threshold_Ct_Chng]
Total_Ct_Chng_Cap
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| 12 | Existing Customer | 56 | M | 1 | College | Single | $80K - $120K | Blue | 36 | 3 | 6 | 0 | 11751.000 | 0 | 11751.000 | 3.397 | 1539 | 17 | 3.250 | 0.000 |
| 13 | Existing Customer | 35 | M | 3 | Graduate | NaN | $60K - $80K | Blue | 30 | 5 | 1 | 3 | 8547.000 | 1666 | 6881.000 | 1.163 | 1311 | 33 | 2.000 | 0.195 |
| 15 | Existing Customer | 44 | M | 4 | NaN | NaN | $80K - $120K | Blue | 37 | 5 | 1 | 2 | 4234.000 | 972 | 3262.000 | 1.707 | 1348 | 27 | 1.700 | 0.230 |
| 17 | Existing Customer | 41 | M | 3 | NaN | Married | $80K - $120K | Blue | 34 | 4 | 4 | 1 | 13535.000 | 1291 | 12244.000 | 0.653 | 1028 | 21 | 1.625 | 0.095 |
| 22 | Existing Customer | 41 | M | 3 | High School | Married | $40K - $60K | Blue | 33 | 4 | 2 | 1 | 4470.000 | 680 | 3790.000 | 1.608 | 931 | 18 | 1.571 | 0.152 |
| 28 | Existing Customer | 44 | F | 3 | Uneducated | Single | abc | Blue | 34 | 5 | 2 | 2 | 10100.000 | 0 | 10100.000 | 0.525 | 1052 | 18 | 1.571 | 0.000 |
| 30 | Existing Customer | 53 | M | 3 | NaN | Married | $80K - $120K | Blue | 33 | 3 | 2 | 3 | 2753.000 | 1811 | 942.000 | 0.977 | 1038 | 25 | 2.571 | 0.658 |
| 31 | Existing Customer | 53 | M | 2 | Uneducated | Married | $60K - $80K | Blue | 48 | 2 | 5 | 1 | 2451.000 | 1690 | 761.000 | 1.323 | 1596 | 26 | 1.600 | 0.690 |
| 32 | Existing Customer | 41 | M | 4 | Graduate | Married | $60K - $80K | Blue | 36 | 4 | 1 | 2 | 8923.000 | 2517 | 6406.000 | 1.726 | 1589 | 24 | 1.667 | 0.282 |
| 47 | Existing Customer | 59 | M | 1 | Doctorate | Married | $40K - $60K | Blue | 52 | 3 | 2 | 2 | 2548.000 | 2020 | 528.000 | 2.357 | 1719 | 27 | 1.700 | 0.793 |
| 52 | Existing Customer | 66 | F | 0 | High School | Married | Less than $40K | Blue | 54 | 3 | 4 | 2 | 3171.000 | 2179 | 992.000 | 1.224 | 1946 | 38 | 1.923 | 0.687 |
| 68 | Existing Customer | 49 | M | 2 | Graduate | Married | $60K - $80K | Blue | 32 | 2 | 2 | 2 | 1687.000 | 1107 | 580.000 | 1.715 | 1670 | 17 | 2.400 | 0.656 |
| 69 | Existing Customer | 50 | M | 2 | Doctorate | Married | $80K - $120K | Blue | 38 | 6 | 2 | 2 | 25300.000 | 1330 | 23970.000 | 1.072 | 837 | 15 | 2.000 | 0.053 |
| 76 | Existing Customer | 44 | F | 4 | Graduate | Single | Less than $40K | Blue | 36 | 6 | 4 | 2 | 7000.000 | 2517 | 4483.000 | 0.475 | 1112 | 23 | 1.875 | 0.360 |
| 84 | Existing Customer | 53 | M | 1 | Graduate | Divorced | $80K - $120K | Blue | 35 | 5 | 4 | 2 | 34516.000 | 1219 | 33297.000 | 1.129 | 1590 | 27 | 2.000 | 0.035 |
| 88 | Existing Customer | 44 | M | 3 | High School | Single | $60K - $80K | Blue | 31 | 4 | 3 | 1 | 12756.000 | 837 | 11919.000 | 1.932 | 1413 | 14 | 1.800 | 0.066 |
| 89 | Existing Customer | 57 | M | 2 | NaN | Married | $120K + | Blue | 45 | 5 | 3 | 3 | 5266.000 | 0 | 5266.000 | 1.702 | 1516 | 29 | 1.636 | 0.000 |
| 91 | Existing Customer | 49 | M | 4 | High School | Single | $80K - $120K | Blue | 38 | 4 | 3 | 0 | 31302.000 | 1953 | 29349.000 | 0.875 | 1564 | 35 | 2.182 | 0.062 |
| 111 | Existing Customer | 42 | M | 3 | NaN | Married | $60K - $80K | Blue | 36 | 5 | 3 | 3 | 15088.000 | 865 | 14223.000 | 0.939 | 1107 | 21 | 1.625 | 0.057 |
| 113 | Existing Customer | 54 | F | 0 | Uneducated | Married | Less than $40K | Blue | 36 | 2 | 2 | 2 | 1494.000 | 706 | 788.000 | 1.674 | 1305 | 24 | 3.000 | 0.473 |
| 115 | Existing Customer | 49 | M | 1 | Graduate | Single | $80K - $120K | Blue | 36 | 6 | 2 | 2 | 18886.000 | 895 | 17991.000 | 1.826 | 1235 | 18 | 1.571 | 0.047 |
| 122 | Existing Customer | 53 | M | 1 | High School | Married | $80K - $120K | Blue | 47 | 3 | 2 | 2 | 12262.000 | 1836 | 10426.000 | 1.584 | 1659 | 38 | 1.533 | 0.150 |
| 131 | Existing Customer | 43 | M | 4 | NaN | Married | $40K - $60K | Blue | 39 | 5 | 1 | 2 | 6111.000 | 2517 | 3594.000 | 0.632 | 1221 | 16 | 2.200 | 0.412 |
| 132 | Existing Customer | 57 | M | 4 | Graduate | Married | $80K - $120K | Blue | 46 | 3 | 2 | 3 | 19270.000 | 1662 | 17608.000 | 1.186 | 1565 | 28 | 1.545 | 0.086 |
| 146 | Existing Customer | 41 | F | 2 | Graduate | Single | Less than $40K | Blue | 32 | 6 | 3 | 2 | 2250.000 | 2117 | 133.000 | 1.162 | 1617 | 31 | 2.875 | 0.941 |
| 151 | Existing Customer | 68 | M | 1 | Graduate | Married | abc | Blue | 56 | 5 | 2 | 3 | 13860.000 | 1652 | 12208.000 | 1.255 | 1910 | 32 | 1.909 | 0.119 |
| 158 | Existing Customer | 44 | F | 2 | Uneducated | Married | abc | Silver | 35 | 4 | 3 | 2 | 32643.000 | 0 | 32643.000 | 1.300 | 1058 | 24 | 2.429 | 0.000 |
| 162 | Existing Customer | 46 | M | 1 | Uneducated | Married | $60K - $80K | Blue | 36 | 3 | 2 | 2 | 20459.000 | 2071 | 18388.000 | 1.248 | 1767 | 38 | 2.167 | 0.101 |
| 167 | Existing Customer | 58 | F | 4 | Uneducated | Married | Less than $40K | Blue | 47 | 3 | 3 | 3 | 2822.000 | 2173 | 649.000 | 1.456 | 1218 | 23 | 2.286 | 0.770 |
| 190 | Existing Customer | 57 | M | 1 | Graduate | Married | $80K - $120K | Blue | 47 | 5 | 3 | 1 | 14612.000 | 1976 | 12636.000 | 1.768 | 1827 | 24 | 3.000 | 0.135 |
| 231 | Existing Customer | 57 | M | 2 | NaN | Married | $80K - $120K | Blue | 46 | 2 | 3 | 0 | 18871.000 | 1740 | 17131.000 | 1.727 | 1516 | 21 | 2.000 | 0.092 |
| 239 | Existing Customer | 44 | M | 4 | NaN | Single | $80K - $120K | Blue | 34 | 4 | 1 | 2 | 21573.000 | 1585 | 19988.000 | 0.621 | 1384 | 36 | 2.273 | 0.073 |
| 255 | Existing Customer | 55 | F | 2 | High School | Married | Less than $40K | Blue | 36 | 4 | 3 | 3 | 2358.000 | 1152 | 1206.000 | 0.969 | 1737 | 26 | 1.600 | 0.489 |
| 269 | Existing Customer | 54 | M | 5 | Graduate | Married | $60K - $80K | Blue | 38 | 3 | 3 | 3 | 2290.000 | 1434 | 856.000 | 0.923 | 1119 | 18 | 3.500 | 0.626 |
| 280 | Existing Customer | 43 | M | 1 | Graduate | Single | $80K - $120K | Silver | 37 | 4 | 3 | 2 | 34516.000 | 1440 | 33076.000 | 1.117 | 1575 | 34 | 2.400 | 0.042 |
| 281 | Existing Customer | 41 | M | 4 | High School | NaN | $80K - $120K | Blue | 36 | 4 | 3 | 0 | 23018.000 | 2168 | 20850.000 | 0.859 | 1463 | 24 | 1.667 | 0.094 |
| 282 | Existing Customer | 54 | F | 1 | Uneducated | Married | $40K - $60K | Blue | 44 | 5 | 2 | 3 | 5556.000 | 1711 | 3845.000 | 1.434 | 1706 | 21 | 1.625 | 0.308 |
| 284 | Existing Customer | 61 | M | 0 | Graduate | Married | $40K - $60K | Blue | 52 | 3 | 1 | 2 | 2939.000 | 1999 | 940.000 | 2.145 | 2434 | 33 | 1.538 | 0.680 |
| 294 | Existing Customer | 45 | M | 2 | College | Single | $60K - $80K | Blue | 33 | 6 | 3 | 0 | 23218.000 | 1814 | 21404.000 | 1.178 | 1749 | 37 | 2.083 | 0.078 |
| 300 | Existing Customer | 50 | M | 0 | High School | Single | $120K + | Silver | 45 | 5 | 2 | 3 | 14938.000 | 2303 | 12635.000 | 0.804 | 949 | 27 | 2.000 | 0.154 |
| 309 | Existing Customer | 43 | F | 4 | Uneducated | Married | Less than $40K | Blue | 36 | 2 | 1 | 1 | 9684.000 | 2475 | 7209.000 | 0.697 | 1400 | 31 | 2.100 | 0.256 |
| 322 | Existing Customer | 51 | F | 2 | Uneducated | Divorced | Less than $40K | Blue | 38 | 4 | 1 | 2 | 9648.000 | 1926 | 7722.000 | 0.549 | 1083 | 22 | 1.750 | 0.200 |
| 323 | Existing Customer | 52 | M | 2 | High School | Single | $80K - $120K | Silver | 40 | 3 | 3 | 3 | 34516.000 | 2177 | 32339.000 | 1.320 | 1234 | 23 | 1.875 | 0.063 |
| 324 | Existing Customer | 55 | F | 2 | Uneducated | Married | abc | Blue | 35 | 3 | 3 | 0 | 3025.000 | 2376 | 649.000 | 1.252 | 1277 | 23 | 1.875 | 0.785 |
| 346 | Existing Customer | 58 | F | 3 | High School | Married | Less than $40K | Blue | 47 | 6 | 1 | 0 | 2609.000 | 2439 | 170.000 | 1.269 | 1731 | 34 | 1.833 | 0.935 |
| 366 | Existing Customer | 36 | F | 4 | Graduate | Married | $40K - $60K | Blue | 36 | 6 | 3 | 3 | 1628.000 | 969 | 659.000 | 0.999 | 1893 | 15 | 2.750 | 0.595 |
| 392 | Existing Customer | 56 | M | 3 | High School | Married | $60K - $80K | Blue | 36 | 5 | 1 | 3 | 9226.000 | 1233 | 7993.000 | 1.521 | 1626 | 29 | 1.636 | 0.134 |
| 418 | Existing Customer | 44 | M | 5 | Uneducated | Married | $60K - $80K | Blue | 36 | 3 | 3 | 2 | 8645.000 | 1454 | 7191.000 | 0.708 | 1435 | 25 | 1.778 | 0.168 |
| 456 | Existing Customer | 47 | M | 3 | Uneducated | Married | $80K - $120K | Blue | 36 | 6 | 2 | 2 | 29770.000 | 1565 | 28205.000 | 0.674 | 1774 | 42 | 2.000 | 0.053 |
| 533 | Existing Customer | 59 | F | 0 | NaN | Married | Less than $40K | Blue | 53 | 4 | 3 | 2 | 5512.000 | 1124 | 4388.000 | 1.391 | 1975 | 49 | 1.579 | 0.204 |
| 697 | Existing Customer | 55 | M | 2 | High School | Married | $80K - $120K | Blue | 36 | 3 | 2 | 3 | 6514.000 | 2258 | 4256.000 | 0.785 | 1383 | 20 | 1.857 | 0.347 |
| 757 | Attrited Customer | 43 | M | 2 | Uneducated | Married | $60K - $80K | Blue | 24 | 2 | 3 | 3 | 14447.000 | 0 | 14447.000 | 0.856 | 915 | 29 | 2.222 | 0.000 |
| 760 | Existing Customer | 62 | F | 0 | NaN | Married | $40K - $60K | Blue | 36 | 4 | 2 | 2 | 2964.000 | 2400 | 564.000 | 1.296 | 1557 | 25 | 1.778 | 0.810 |
| 773 | Existing Customer | 61 | M | 0 | Post-Graduate | Married | abc | Blue | 53 | 6 | 2 | 3 | 14434.000 | 1927 | 12507.000 | 2.675 | 1731 | 32 | 3.571 | 0.134 |
| 805 | Existing Customer | 29 | M | 0 | Post-Graduate | Single | $40K - $60K | Blue | 19 | 4 | 1 | 2 | 13632.000 | 1482 | 12150.000 | 0.893 | 2168 | 42 | 2.500 | 0.109 |
| 968 | Existing Customer | 54 | M | 2 | Graduate | Married | $60K - $80K | Blue | 42 | 6 | 2 | 0 | 3583.000 | 0 | 3583.000 | 1.088 | 1065 | 28 | 1.545 | 0.000 |
| 1041 | Existing Customer | 36 | F | 3 | High School | Married | Less than $40K | Blue | 26 | 6 | 1 | 4 | 4331.000 | 1445 | 2886.000 | 1.070 | 2053 | 33 | 1.750 | 0.334 |
| 1095 | Existing Customer | 54 | F | 3 | Doctorate | Married | $40K - $60K | Blue | 39 | 3 | 1 | 3 | 2112.000 | 1594 | 518.000 | 1.204 | 1338 | 29 | 2.222 | 0.755 |
| 1187 | Existing Customer | 51 | F | 4 | Graduate | Single | abc | Blue | 34 | 3 | 1 | 2 | 11221.000 | 1279 | 9942.000 | 1.116 | 1564 | 45 | 1.647 | 0.114 |
| 1256 | Existing Customer | 43 | M | 2 | High School | Divorced | $80K - $120K | Blue | 33 | 4 | 3 | 2 | 3083.000 | 1898 | 1185.000 | 1.042 | 1448 | 30 | 2.000 | 0.616 |
| 1455 | Existing Customer | 39 | F | 2 | Doctorate | Married | abc | Blue | 36 | 5 | 2 | 4 | 8058.000 | 791 | 7267.000 | 1.787 | 2742 | 42 | 2.000 | 0.098 |
| 1778 | Existing Customer | 38 | F | 2 | Uneducated | Married | Less than $40K | Blue | 28 | 3 | 3 | 3 | 7049.000 | 0 | 7049.000 | 0.988 | 1475 | 27 | 1.700 | 0.000 |
| 2099 | Existing Customer | 65 | F | 1 | Graduate | Single | abc | Blue | 55 | 5 | 2 | 4 | 6463.000 | 1660 | 4803.000 | 0.995 | 2336 | 49 | 1.722 | 0.257 |
| 2358 | Existing Customer | 28 | M | 0 | High School | Married | abc | Blue | 21 | 5 | 3 | 5 | 25618.000 | 1209 | 24409.000 | 1.228 | 2157 | 49 | 1.882 | 0.047 |
| 2510 | Attrited Customer | 54 | M | 3 | High School | Married | $40K - $60K | Blue | 44 | 2 | 1 | 5 | 3032.000 | 0 | 3032.000 | 0.949 | 1037 | 14 | 2.500 | 0.000 |
| 2565 | Existing Customer | 39 | M | 3 | Graduate | Married | $120K + | Blue | 36 | 3 | 3 | 2 | 32964.000 | 2231 | 30733.000 | 1.731 | 3094 | 45 | 1.647 | 0.068 |
| 2683 | Existing Customer | 33 | F | 3 | Post-Graduate | Married | $40K - $60K | Blue | 28 | 5 | 2 | 3 | 1864.000 | 1081 | 783.000 | 1.416 | 2607 | 44 | 1.750 | 0.580 |
| 2696 | Existing Customer | 47 | F | 3 | Uneducated | Married | Less than $40K | Blue | 32 | 4 | 3 | 2 | 2821.000 | 951 | 1870.000 | 0.577 | 1452 | 38 | 1.923 | 0.337 |
| 2982 | Existing Customer | 45 | M | 4 | NaN | Married | $40K - $60K | Blue | 30 | 6 | 3 | 3 | 1892.000 | 1049 | 843.000 | 1.122 | 4558 | 60 | 1.609 | 0.554 |
| 3221 | Existing Customer | 35 | M | 3 | High School | NaN | $80K - $120K | Blue | 30 | 5 | 3 | 3 | 11229.000 | 1054 | 10175.000 | 0.552 | 1898 | 36 | 1.571 | 0.094 |
| 4086 | Existing Customer | 50 | M | 3 | College | Single | $40K - $60K | Blue | 36 | 3 | 2 | 2 | 3069.000 | 1881 | 1188.000 | 1.111 | 2909 | 58 | 1.522 | 0.613 |
| 4146 | Existing Customer | 48 | M | 4 | Doctorate | NaN | $80K - $120K | Blue | 40 | 3 | 2 | 2 | 3429.000 | 1268 | 2161.000 | 0.703 | 4422 | 64 | 1.783 | 0.370 |
| 4597 | Existing Customer | 39 | M | 2 | Graduate | Married | $60K - $80K | Blue | 23 | 5 | 2 | 1 | 4761.000 | 1561 | 3200.000 | 0.690 | 2045 | 53 | 1.650 | 0.328 |
| 7134 | Existing Customer | 46 | M | 3 | High School | Married | $60K - $80K | Blue | 36 | 4 | 3 | 1 | 4696.000 | 0 | 4696.000 | 0.732 | 4828 | 62 | 1.583 | 0.000 |
| 9977 | Attrited Customer | 41 | F | 3 | Uneducated | Married | Less than $40K | Blue | 30 | 1 | 2 | 3 | 9491.000 | 0 | 9491.000 | 1.056 | 9061 | 51 | 1.684 | 0.000 |
# Observation on Months_on_book
histogram_boxplot(bk, "Months_Inactive_12_mon")
# Observation on Months_on_book
histogram_boxplot(bk, "Contacts_Count_12_mon")
### Most customers had contact with the bank three times a year.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(bk, "Attrition_Flag")
labeled_barplot(bk, "Gender")
labeled_barplot(bk, "Dependent_count")
labeled_barplot(bk, "Education_Level")
labeled_barplot(bk, "Marital_Status")
# We have come accorss an issue with the Income_Category variable;
# the value 'abc' will have to be replaced. We will treat is as missing for now.
labeled_barplot(bk, "Income_Category")
# looking at rows with 'abc'
bk.loc[bk["Income_Category"] == "abc"]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | Existing Customer | 45 | F | 2 | Graduate | Married | abc | Blue | 37 | 6 | 1 | 2 | 14470.000 | 1157 | 13313.000 | 0.966 | 1207 | 21 | 0.909 | 0.080 |
| 28 | Existing Customer | 44 | F | 3 | Uneducated | Single | abc | Blue | 34 | 5 | 2 | 2 | 10100.000 | 0 | 10100.000 | 0.525 | 1052 | 18 | 1.571 | 0.000 |
| 39 | Attrited Customer | 66 | F | 0 | Doctorate | Married | abc | Blue | 56 | 5 | 4 | 3 | 7882.000 | 605 | 7277.000 | 1.052 | 704 | 16 | 0.143 | 0.077 |
| 44 | Existing Customer | 38 | F | 4 | Graduate | Single | abc | Blue | 28 | 2 | 3 | 3 | 9830.000 | 2055 | 7775.000 | 0.977 | 1042 | 23 | 0.917 | 0.209 |
| 58 | Existing Customer | 44 | F | 5 | Graduate | Married | abc | Blue | 35 | 4 | 1 | 2 | 6273.000 | 978 | 5295.000 | 2.275 | 1359 | 25 | 1.083 | 0.156 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10021 | Attrited Customer | 30 | F | 1 | Graduate | Married | abc | Blue | 18 | 4 | 1 | 4 | 4377.000 | 2517 | 1860.000 | 0.941 | 8759 | 74 | 0.609 | 0.575 |
| 10040 | Attrited Customer | 50 | F | 3 | Doctorate | Single | abc | Blue | 36 | 4 | 3 | 3 | 5173.000 | 0 | 5173.000 | 0.912 | 8757 | 68 | 0.789 | 0.000 |
| 10083 | Existing Customer | 42 | F | 4 | Uneducated | Married | abc | Blue | 23 | 4 | 1 | 2 | 8348.000 | 0 | 8348.000 | 0.695 | 15905 | 111 | 0.708 | 0.000 |
| 10092 | Attrited Customer | 40 | F | 3 | Graduate | Married | abc | Blue | 25 | 1 | 2 | 3 | 6888.000 | 1878 | 5010.000 | 1.059 | 9038 | 64 | 0.829 | 0.273 |
| 10119 | Attrited Customer | 55 | F | 3 | Uneducated | Single | abc | Blue | 47 | 4 | 3 | 3 | 14657.000 | 2517 | 12140.000 | 0.166 | 6009 | 53 | 0.514 | 0.172 |
1111 rows × 20 columns
# Replacing 'abc' with 'Nan'. We will deal with this missing value later
bk["Income_Category"].replace("abc", np.nan, inplace=True)
# Verifying 'abc' is now a missing value
labeled_barplot(bk, "Income_Category")
# Verifying 'abc' is now a missing value.
bk.isnull().sum().sort_values(ascending=False)
Education_Level 1518 Income_Category 1111 Marital_Status 749 Attrition_Flag 0 Credit_Limit 0 Total_Ct_Chng_Q4_Q1 0 Total_Trans_Ct 0 Total_Trans_Amt 0 Total_Amt_Chng_Q4_Q1 0 Avg_Open_To_Buy 0 Total_Revolving_Bal 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Customer_Age 0 Total_Relationship_Count 0 Months_on_book 0 Card_Category 0 Dependent_count 0 Gender 0 Avg_Utilization_Ratio 0 dtype: int64
labeled_barplot(bk, "Card_Category")
labeled_barplot(bk, "Total_Relationship_Count")
labeled_barplot(bk, "Months_Inactive_12_mon")
labeled_barplot(bk, "Contacts_Count_12_mon")
sns.pairplot(bk, hue="Attrition_Flag")
<seaborn.axisgrid.PairGrid at 0x22474c93490>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(y="Total_Trans_Amt", x="Marital_Status", data=bk, orient="vertical")
<AxesSubplot:xlabel='Marital_Status', ylabel='Total_Trans_Amt'>
sns.boxplot(y="Total_Trans_Amt", x="Attrition_Flag", data=bk, orient="vertical")
<AxesSubplot:xlabel='Attrition_Flag', ylabel='Total_Trans_Amt'>
sns.boxplot(y="Total_Trans_Amt", x="Education_Level", data=bk, orient="vertical")
<AxesSubplot:xlabel='Education_Level', ylabel='Total_Trans_Amt'>
sns.boxplot(y="Total_Trans_Amt", x="Income_Category", data=bk, orient="vertical")
<AxesSubplot:xlabel='Income_Category', ylabel='Total_Trans_Amt'>
sns.boxplot(y="Total_Trans_Amt", x="Card_Category", data=bk, orient="vertical")
<AxesSubplot:xlabel='Card_Category', ylabel='Total_Trans_Amt'>
sns.boxplot(
y="Total_Trans_Amt", x="Total_Relationship_Count", data=bk, orient="vertical"
)
<AxesSubplot:xlabel='Total_Relationship_Count', ylabel='Total_Trans_Amt'>
pd.pivot_table(
data=bk, index=["Customer_Age"], values="Total_Trans_Amt", aggfunc=np.sum,
).plot(kind="line", marker="o", linewidth=2)
<AxesSubplot:xlabel='Customer_Age'>
pd.pivot_table(
data=bk, index=["Dependent_count"], values="Total_Trans_Amt", aggfunc=np.sum,
).plot(kind="line", marker="o", linewidth=2)
<AxesSubplot:xlabel='Dependent_count'>
pd.pivot_table(
data=bk, index=["Customer_Age"], values="Credit_Limit", aggfunc=np.sum,
).plot(kind="line", marker="o", linewidth=2)
<AxesSubplot:xlabel='Customer_Age'>
pd.pivot_table(
data=bk, index=["Dependent_count"], values="Credit_Limit", aggfunc=np.sum,
).plot(kind="line", marker="o", linewidth=2)
<AxesSubplot:xlabel='Dependent_count'>
cols = bk[
[
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Avg_Utilization_Ratio",
]
].columns.tolist()
plt.figure(figsize=(25, 25))
for i, variable in enumerate(cols):
plt.subplot(3, 4, i + 1)
sns.boxplot(bk["Attrition_Flag"], bk[variable])
plt.tight_layout()
plt.title(variable)
plt.show()
• Customers with more Months on Book are more likely to use Credit Card services.
• Customers that own more products with the bank are more likely to utilize credit card services. Maybe the bank should try to advertise more products to customers that hold only one or two products with the bank.
• Customers that spend more per transaction are more likely to use Credit Card services.
• Customers that perform more transactions are more likely to use Credit Card services.
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(bk, "Education_Level", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7234 8605 Graduate 487 2641 3128 High School 306 1704 2010 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ------------------------------------------------------------------------------------------------------------------------
• We can see a negative relationship in Education Level and likeliness to remain a Credit Card holder. The higher the Education Level, the less likely.
stacked_barplot(bk, "Income_Category", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1440 7572 9012 Less than $40K 612 2948 3560 $40K - $60K 271 1518 1789 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 $120K + 126 600 726 ------------------------------------------------------------------------------------------------------------------------
• We also see a negative relationship in Income Category and likeliness to remain a Credit Card holder. The higher the Income, the less more likely; though variation isn't significant.Based on the last to Bar Charts, the bank needs to offer awards and other Credit Card incentives to gain high earning customers.
stacked_barplot(bk, "Card_Category", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8496 10123 Blue 1519 7913 9432 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------------------------------------------------------------------------------------------------
• Silver and Blue card holders utilize credit card services more so than Gold and Platinum customers.
stacked_barplot(bk, "Marital_Status", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1498 7876 9374 Married 709 3974 4683 Single 668 3275 3943 Divorced 121 627 748 ------------------------------------------------------------------------------------------------------------------------
• Single customers are less likely to be a card holder over Divorced and Married. Variation is very insignificant here.
stacked_barplot(bk, "Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Months_Inactive_12_mon All 1627 8496 10123 3 826 3019 3845 2 505 2775 3280 4 130 304 434 1 100 2133 2233 5 32 146 178 6 19 105 124 0 15 14 29 ------------------------------------------------------------------------------------------------------------------------
• This graph shows that customers active the most with the bank are more likely to leave Credit Card Services. One logical explanation for this could be poor customer service. Another could be not being satisfied with some aspect of the Credit Card terms and condition. Management needs to drill down into this and find a root cause.
stacked_barplot(bk, "Contacts_Count_12_mon", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Contacts_Count_12_mon All 1627 8496 10123 3 681 2697 3378 2 403 2823 3226 4 315 1077 1392 1 108 1390 1498 5 59 117 176 6 54 0 54 0 7 392 399 ------------------------------------------------------------------------------------------------------------------------
• Surprisingly this graph shows that increased customer contact with the bank leads to customers leaving Credit Card Services. The Customer Service Department needs to be retrained on how to interact with customers.
stacked_barplot(bk, "Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Total_Relationship_Count All 1627 8496 10123 3 400 1904 2304 2 346 896 1242 1 233 676 909 5 227 1663 1890 4 225 1687 1912 6 196 1670 1866 ------------------------------------------------------------------------------------------------------------------------
• Customers that own more products with the bank are more likely to be card holders.
stacked_barplot(bk, "Dependent_count", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Dependent_count All 1627 8496 10123 3 482 2249 2731 2 417 2238 2655 1 269 1568 1837 4 260 1314 1574 0 135 767 902 5 64 360 424 ------------------------------------------------------------------------------------------------------------------------
• Dependent Count doesn’t show any significant variation in determining a potential card holder. All values are very close here.
stacked_barplot(bk, "Gender", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8496 10123 F 930 4427 5357 M 697 4069 4766 ------------------------------------------------------------------------------------------------------------------------
• Men are slightly more likely to opt for Credit Card services.
plt.figure(figsize=(15, 7))
sns.heatmap(bk.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
• Customer Age and Months on Book have a positive correlation. This is expected; the older people get the less likely they are to go through the hassle of changing banks.
• Credit limit and Average open To Buy have a positive correlation; not surprising, higher credit limits doesn’t necessarily mean more spending. In most cases higher credit limits indicate smart spending habits.
• Total Transaction Amount and Total Transaction Count have a positive correlation. It’s expected that customers who make more transactions are likely to spend more per transaction, because they likely use their card for small and major purchases alike. High frequency use of a credit card will result in transaction limits on both ends of the cost spectrum.
• Average Utilization Ratio and Credit Limit are negatively correlated. Customers that keep a revolving balance of more than 30 percent on their cards are less likely to receive credit line increases.
• Average Utilization Ration and Average Open to Buy are negatively correlated.
• Total Revolving Balance and Average Utilization Ration are positively correlated.
attrition_map = {"Existing Customer": 1, "Attrited Customer": 0}
bk["Attrition_Flag"] = bk["Attrition_Flag"].map(attrition_map)
bk.head(15)
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 1 | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 1 | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 1 | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 1 | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| 5 | 1 | 44 | M | 2 | Graduate | Married | $40K - $60K | Blue | 36 | 3 | 1 | 2 | 4010.000 | 1247 | 2763.000 | 1.376 | 1088 | 24 | 0.846 | 0.311 |
| 6 | 1 | 51 | M | 4 | NaN | Married | $120K + | Gold | 46 | 6 | 1 | 3 | 34516.000 | 2264 | 32252.000 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 7 | 1 | 32 | M | 0 | High School | NaN | $60K - $80K | Silver | 27 | 2 | 2 | 2 | 29081.000 | 1396 | 27685.000 | 2.204 | 1538 | 36 | 0.714 | 0.048 |
| 8 | 1 | 37 | M | 3 | Uneducated | Single | $60K - $80K | Blue | 36 | 5 | 2 | 0 | 22352.000 | 2517 | 19835.000 | 3.355 | 1350 | 24 | 1.182 | 0.113 |
| 9 | 1 | 48 | M | 2 | Graduate | Single | $80K - $120K | Blue | 36 | 6 | 3 | 3 | 11656.000 | 1677 | 9979.000 | 1.524 | 1441 | 32 | 0.882 | 0.144 |
| 10 | 1 | 42 | M | 5 | Uneducated | NaN | $120K + | Blue | 31 | 5 | 3 | 2 | 6748.000 | 1467 | 5281.000 | 0.831 | 1201 | 42 | 0.680 | 0.217 |
| 11 | 1 | 65 | M | 1 | NaN | Married | $40K - $60K | Blue | 54 | 6 | 2 | 3 | 9095.000 | 1587 | 7508.000 | 1.433 | 1314 | 26 | 1.364 | 0.174 |
| 12 | 1 | 56 | M | 1 | College | Single | $80K - $120K | Blue | 36 | 3 | 6 | 0 | 11751.000 | 0 | 11751.000 | 3.397 | 1539 | 17 | 3.250 | 0.000 |
| 13 | 1 | 35 | M | 3 | Graduate | NaN | $60K - $80K | Blue | 30 | 5 | 1 | 3 | 8547.000 | 1666 | 6881.000 | 1.163 | 1311 | 33 | 2.000 | 0.195 |
| 14 | 1 | 57 | F | 2 | Graduate | Married | Less than $40K | Blue | 48 | 5 | 2 | 2 | 2436.000 | 680 | 1756.000 | 1.190 | 1570 | 29 | 0.611 | 0.279 |
# Separating target variable and other variables
X = bk.drop(columns="Attrition_Flag")
X = pd.get_dummies(X)
Y = bk["Attrition_Flag"]
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=Y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6073, 34) (2025, 34) (2025, 34)
• We will impute missing values using the most frequent values.
imputer = SimpleImputer(strategy="most_frequent")
impute = imputer.fit(X_train)
X_train = impute.transform(X_train)
X_val = imputer.transform(X_val)
X_test = imputer.transform(X_test)
print(X_train.shape, X_val.shape, X_test.shape)
(6073, 34) (2025, 34) (2025, 34)
# scaling data for better defined models
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
• Predicting customer will not utilize Credit Card services and they do; losing on a potential source of income for the company because that customer will not be targeted by the marketing team when he/she should be targeted.
Company wants recall to be maximized; we need to reduce the number of false negatives to achieve this.
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
score.append(scores)
print("{}: {}".format(name, scores))
Cross-Validation Performance: Bagging: 97.15540033481498 Random forest: 98.6658392503223 GBM: 98.74436683407417 Adaboost: 98.11658873559237 Xgboost: 98.744463045277 dtree: 96.07598760799706 Validation Performance: Bagging: 0.9711595055915244 Random forest: 0.9882283696291937 GBM: 0.9864626250735727 Adaboost: 0.9799882283696292 Xgboost: 0.9888169511477339 dtree: 0.9664508534432019
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
• Xgboost has the highest Cross-Validation recall score followed by Gradient Boost, then Random Forest. The performances of these generalized will on the Validation set as well.
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 5097 Before UpSampling, counts of label 'No': 976 After UpSampling, counts of label 'Yes': 5097 After UpSampling, counts of label 'No': 5097 After UpSampling, the shape of train_X: (10194, 34) After UpSampling, the shape of train_y: (10194,)
bag_up = [] # Empty list to store all the models
# Appending models into the list
bag_up.append(("Bagging", BaggingClassifier(random_state=1)))
bag_up.append(("Random forest", RandomForestClassifier(random_state=1)))
bag_up.append(("GBM", GradientBoostingClassifier(random_state=1)))
bag_up.append(("Adaboost", AdaBoostClassifier(random_state=1)))
bag_up.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
bag_up.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in bag_up:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_over = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold,
)
results.append(cv_result_over)
names.append(name)
print("{}: {}".format(name, cv_result_over.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in bag_up:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
score.append(scores)
print("{}: {}".format(name, scores))
Cross-Validation Performance: Bagging: 94.44784390694453 Random forest: 96.88042871711981 GBM: 96.54715311050819 Adaboost: 94.85985876195426 Xgboost: 98.234408974581 dtree: 94.0749292847659 Validation Performance: Bagging: 0.951736315479694 Random forest: 0.9699823425544438 GBM: 0.9699823425544438 Adaboost: 0.9446733372572101 Xgboost: 0.9858740435550324 dtree: 0.9405532666274279
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
• Like the first model we built Xgboost has the highest Recall Score when using up-sampling, followed by Random Forest and Gradient Boost.
• This model also generalized will on the Validation set.
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 5097 Before Under Sampling, counts of label 'No': 976 After Under Sampling, counts of label 'Yes': 976 After Under Sampling, counts of label 'No': 976 After Under Sampling, the shape of train_X: (1952, 34) After Under Sampling, the shape of train_y: (1952,)
bag_dn = [] # Empty list to store all the models
# Appending models into the list
bag_dn.append(("Bagging", BaggingClassifier(random_state=1)))
bag_dn.append(("Random forest", RandomForestClassifier(random_state=1)))
bag_dn.append(("GBM", GradientBoostingClassifier(random_state=1)))
bag_dn.append(("Adaboost", AdaBoostClassifier(random_state=1)))
bag_dn.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
bag_dn.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in bag_dn:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_under = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
results.append(cv_result_under)
names.append(name)
print("{}: {}".format(name, cv_result_under.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in bag_dn:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
score.append(scores)
print("{}: {}".format(name, scores))
Cross-Validation Performance: Bagging: 89.85714285714288 Random forest: 92.62323390894821 GBM: 93.64730507587652 Adaboost: 92.62271062271064 Xgboost: 94.46781789638933 dtree: 89.34432234432232 Validation Performance: Bagging: 0.901706886403767 Random forest: 0.9258387286639199 GBM: 0.9405532666274279 Adaboost: 0.924072984108299 Xgboost: 0.9429075927015892 dtree: 0.8993525603296056
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
• Xgboost again is showing the highest Recall Score for down-sampling, followed closely by Gradient Boost and Random Forest.
• This model also generalized well on the Validation set.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
%%time
# defining model
model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {'max_depth': np.arange(2,30),
'min_samples_leaf': [1, 2, 5, 7, 10],
'max_leaf_nodes' : [2, 3, 5, 10,15],
'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'max_depth': 2, 'max_leaf_nodes': 2, 'min_impurity_decrease': 0.0001, 'min_samples_leaf': 1}
Score: 1.0
Wall time: 20min 7s
# building model with best parameters
RF_tuned_grid = RandomForestClassifier(
min_samples_leaf=1,
min_impurity_decrease=0.0001,
max_leaf_nodes=2,
max_depth=2,
random_state=1,
)
# Fit the model on training data
RF_tuned_grid.fit(X_train, y_train)
RandomForestClassifier(max_depth=2, max_leaf_nodes=2,
min_impurity_decrease=0.0001, random_state=1)
# Calculating different metrics on train set
RandomForest_grid_train = model_performance_classification_sklearn(
RF_tuned_grid, X_train, y_train
)
print("Training performance:")
RandomForest_grid_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839 | 1.000 | 0.839 | 0.913 |
# Calculating different metrics on validation set
RandomForest_grid_val = model_performance_classification_sklearn(
RF_tuned_grid, X_val, y_val
)
print("Validation performance:")
RandomForest_grid_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839 | 1.000 | 0.839 | 0.912 |
# creating confusion matrix
confusion_matrix_sklearn(RF_tuned_grid, X_val, y_val)
• The Random Forest generalized will on Training and Validation sets wit Grid Search
%%time
# defining model
model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {'max_depth': np.arange(2,30),
'min_samples_leaf': [1, 2, 5, 7, 10],
'max_leaf_nodes' : [2, 3, 5, 10,15],
'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.01, 'max_leaf_nodes': 10, 'max_depth': 12} with CV score=1.0:
Wall time: 20.4 s
# building model with best parameters
RF_tuned_rand = RandomForestClassifier(
min_samples_leaf=1,
min_impurity_decrease=0.01,
max_leaf_nodes=10,
max_depth=12,
random_state=1,
)
# Fit the model on training data
RF_tuned_rand.fit(X_train, y_train)
RandomForestClassifier(max_depth=12, max_leaf_nodes=10,
min_impurity_decrease=0.01, random_state=1)
# Calculating different metrics on train set
RandomForest_random_train = model_performance_classification_sklearn(
RF_tuned_rand, X_train, y_train
)
print("Training performance:")
RandomForest_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839 | 1.000 | 0.839 | 0.913 |
# Calculating different metrics on validation set
RandomForest_random_val = model_performance_classification_sklearn(
RF_tuned_rand, X_val, y_val
)
print("Validation performance:")
RandomForest_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839 | 1.000 | 0.839 | 0.912 |
# creating confusion matrix
confusion_matrix_sklearn(RF_tuned_rand, X_val, y_val)
• The Random Forest also generalized will with Random Search tuning
• The scores are very close to that of the Grid Search
%%time
#defining model
model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in GridSearchCV
param_grid={
"min_samples_split": [2, 3, 5, 10,15],
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_depth":[3,5,8],
"max_features":["log2","sqrt"],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1, verbose= 2)
#Fitting parameters in GridSeachCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Fitting 5 folds for each of 150 candidates, totalling 750 fits
Best parameters are {'max_depth': 8, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2} with CV score=0.9923483230387348:
Wall time: 2min 9s
# building model with best parameters
GB_tuned_grid = GradientBoostingClassifier(
min_samples_split=2,
min_samples_leaf=1,
max_features="log2",
max_depth=8,
random_state=1,
)
# Fit the model on training data
GB_tuned_grid.fit(X_train, y_train)
GradientBoostingClassifier(max_depth=8, max_features='log2', random_state=1)
# Calculating different metrics on train set
GB_grid_train = model_performance_classification_sklearn(
GB_tuned_grid, X_train, y_train
)
print("Training performance:")
GB_grid_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# Calculating different metrics on validation set
GB_grid_val = model_performance_classification_sklearn(GB_tuned_grid, X_val, y_val)
print("Validation performance:")
GB_grid_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.961 | 0.991 | 0.964 | 0.977 |
# creating confusion matrix
confusion_matrix_sklearn(GB_tuned_grid, X_val, y_val)
• Gradient Boost generalized well on Training and Validation sets; does a good job with predictions
• The scores aren’t too far off from those of the Random Forest.
%%time
#defining model
model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in GridSearchCV
param_grid={
"min_samples_split": [2, 3, 5, 10,15],
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_depth":[3,5,8],
"max_features":["log2","sqrt"],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 8} with CV score=0.9923483230387348:
Wall time: 40.9 s
# building model with best parameters
GB_tuned_rand = GradientBoostingClassifier(
min_samples_leaf=1,
min_samples_split=2,
max_features="sqrt",
max_depth=8,
random_state=1,
)
# Fit the model on training data
GB_tuned_rand.fit(X_train, y_train)
GradientBoostingClassifier(max_depth=8, max_features='sqrt', random_state=1)
# Calculating different metrics on train set
GB_random_train = model_performance_classification_sklearn(
GB_tuned_rand, X_train, y_train
)
print("Training performance:")
GB_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# Calculating different metrics on validation set
GB_random_val = model_performance_classification_sklearn(GB_tuned_rand, X_val, y_val)
print("Validation performance:")
GB_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.961 | 0.991 | 0.964 | 0.977 |
# creating confusion matrix
confusion_matrix_sklearn(GB_tuned_rand, X_val, y_val)
• Gradient Boost does better predicting with Random Search
• Generalized well on both sets. This model is showing the most promise so far.
%%time
#defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in GridSearchCV
param_grid={'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
'reg_lambda':[5,10]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1, verbose= 2)
#Fitting parameters in GridSeachCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Fitting 5 folds for each of 2304 candidates, totalling 11520 fits
Best parameters are {'gamma': 0, 'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 50, 'reg_lambda': 5, 'scale_pos_weight': 2, 'subsample': 0.8} with CV score=1.0:
Wall time: 25min 41s
# building model with best parameters
XG_tuned_grid = XGBClassifier(
subsample=0.80,
scale_pos_weight=2,
reg_lambda=5,
n_estimators=50,
max_depth=1,
learning_rate=0.01,
gamma=0,
random_state=1,
)
# Fit the model on training data
XG_tuned_grid.fit(X_train, y_train)
[03:05:01] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.01, max_delta_step=0, max_depth=1,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=50, n_jobs=8, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=5, scale_pos_weight=2, subsample=0.8,
tree_method='exact', validate_parameters=1, verbosity=None)
# Calculating different metrics on train set
XG_grid_train = model_performance_classification_sklearn(
XG_tuned_grid, X_train, y_train
)
print("Training performance:")
XG_grid_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839 | 1.000 | 0.839 | 0.913 |
# Calculating different metrics on validation set
XG_grid_val = model_performance_classification_sklearn(XG_tuned_grid, X_val, y_val)
print("Validation performance:")
XG_grid_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839 | 1.000 | 0.839 | 0.912 |
# creating confusion matrix
confusion_matrix_sklearn(XG_tuned_grid, X_val, y_val)
• Xgboost with Grid Search generalized well.
• Doesn’t perform as well as Gradient Boosting.
%%time
#defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in GridSearchCV
param_grid={'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
'reg_lambda':[5,10]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'scale_pos_weight': 2, 'reg_lambda': 10, 'n_estimators': 100, 'max_depth': 1, 'learning_rate': 0.01, 'gamma': 5} with CV score=1.0:
Wall time: 34.2 s
# building model with best parameters
XG_tuned_rand = XGBClassifier(
subsample=0.80,
scale_pos_weight=2,
reg_lambda=10,
n_estimators=100,
max_depth=1,
learning_rate=0.01,
gamma=5,
random_state=1,
)
# Fit the model on training data
XG_tuned_rand.fit(X_train, y_train)
[03:05:36] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=5, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.01, max_delta_step=0, max_depth=1,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=10, scale_pos_weight=2, subsample=0.8,
tree_method='exact', validate_parameters=1, verbosity=None)
# Calculating different metrics on train set
XG_random_train = model_performance_classification_sklearn(
XG_tuned_rand, X_train, y_train
)
print("Training performance:")
XG_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839 | 1.000 | 0.839 | 0.913 |
# Calculating different metrics on validation set
XG_random_val = model_performance_classification_sklearn(XG_tuned_rand, X_val, y_val)
print("Validation performance:")
XG_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.839 | 1.000 | 0.839 | 0.912 |
# creating confusion matrix
confusion_matrix_sklearn(XG_tuned_rand, X_val, y_val)
• Xgboost with Random Search is virtually the same as with Grid search
• We will not be using this model, as Gradient Boost proves better as of now.
# training performance comparison
models_train_comp_df = pd.concat(
[
RandomForest_grid_train.T,
RandomForest_random_train.T,
GB_grid_train.T,
GB_random_train.T,
XG_grid_train.T,
XG_random_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"RandomForest Tuned with Grid search",
"RandomForest Tuned with Random search",
"GradientBoost Tuned with Grid search",
"GradientBoost Tuned with Random Search",
"Xgboost Tuned with Grid search",
"Xgboost Tuned With Grid search",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| RandomForest Tuned with Grid search | RandomForest Tuned with Random search | GradientBoost Tuned with Grid search | GradientBoost Tuned with Random Search | Xgboost Tuned with Grid search | Xgboost Tuned With Grid search | |
|---|---|---|---|---|---|---|
| Accuracy | 0.839 | 0.839 | 1.000 | 1.000 | 0.839 | 0.839 |
| Recall | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Precision | 0.839 | 0.839 | 1.000 | 1.000 | 0.839 | 0.839 |
| F1 | 0.913 | 0.913 | 1.000 | 1.000 | 0.913 | 0.913 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
RandomForest_grid_val.T,
RandomForest_random_val.T,
GB_grid_val.T,
GB_random_val.T,
XG_grid_val.T,
XG_random_val.T,
],
axis=1,
)
models_val_comp_df.columns = [
"RandomForest Tuned with Grid search",
"RandomForest Tuned with Random search",
"GradientBoost Tuned with Grid search",
"GradientBoost Tuned with Random Search",
"Xgboost Tuned with Grid search",
"Xgboost Tuned With Grid search",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| RandomForest Tuned with Grid search | RandomForest Tuned with Random search | GradientBoost Tuned with Grid search | GradientBoost Tuned with Random Search | Xgboost Tuned with Grid search | Xgboost Tuned With Grid search | |
|---|---|---|---|---|---|---|
| Accuracy | 0.839 | 0.839 | 0.961 | 0.961 | 0.839 | 0.839 |
| Recall | 1.000 | 1.000 | 0.991 | 0.991 | 1.000 | 1.000 |
| Precision | 0.839 | 0.839 | 0.964 | 0.964 | 0.839 | 0.839 |
| F1 | 0.912 | 0.912 | 0.977 | 0.977 | 0.912 | 0.912 |
# Calculating different metrics on the test set
gradientboost_random_test = model_performance_classification_sklearn(
GB_tuned_rand, X_test, y_test
)
print("Test performance:")
gradientboost_random_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.959 | 0.986 | 0.965 | 0.976 |
feature_names = X.columns
importances = GB_tuned_rand.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
We will create 2 different pipelines, one for numerical columns and one for categorical columns
For numerical columns, we will do missing value imputation as pre-processing
For categorical columns, we will do one hot encoding and missing value imputation as pre-processing
We are doing missing value imputation for the whole data, so that if there is any missing value in the data in future that can be taken care of.
# creating a list of numerical variables
numerical_features = [
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
# creating a transformer for categorical variables, which will first apply simple imputer and
# then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
# handle_unknown = "ignore", allows model to handle any unknown category in the test data
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[("num", numeric_transformer, numerical_features),],
remainder="passthrough",
)
# remainder = "passthrough" has been used, it will allow variables that are present in original data
# but not in "numerical_columns" and "categorical_columns" to pass through the column transformer without any changes
# Separating target variable and other variables
X = df.drop(columns="Attrition_Flag")
X = pd.get_dummies(X)
Y = df["Attrition_Flag"]
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
(7088, 36) (3039, 36)
# Creating new pipeline with best parameters
model = Pipeline(
steps=[
("pre", preprocessor),
(
"GB",
GradientBoostingClassifier(
min_samples_leaf=1,
min_samples_split=2,
max_features="sqrt",
max_depth=8,
random_state=1,
),
),
]
)
# Fit the model on training data
model.fit(X_train, y_train)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Credit_Limit',
'Total_Revolving_Bal',
'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_Q1',
'Total_Trans_Amt',
'Total_Trans_Ct',
'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio'])])),
('GB',
GradientBoostingClassifier(max_depth=8, max_features='sqrt',
random_state=1))])